Ensemble models are machine learning models that use more than one predictor to arrive at a prediction. A group of predictors form an ensemble. In general, ensemble models perform better than using a single predictor (model) with improved efficiency and accuracy.
Ensemble learning can be classified as
Rather than used as standalone classifier model, decision trees most commonly utilised to construct bagging and boosting pipelines -- in this case, random forests. Therefore, you will often see use cases of Gradient Boosting/XGBoost and AdaBoost. Extreme Gradient Boosting will be covered in a separate Python Lab (XGBoost on the collection of features around S&P500 price).
Assumedly, decision trees have a higher likelihood to accentuate different features (compared to other ML/Supervised Learning techniques). Therefore, different boosted samples will produce very different decision trees -- a regression model analogy of that is increasing/reducing to near zero its coefficients..
Decision trees also a default choice for the alternative model -- we will see it when creating the pipeline Bagging of Logistic Regressor. Trees will produce a richer set of outcomes. Consensus of independent weak learners: random forests can improve on bagging if the correlation between the sampled trees is reduced.
There is a possibility to create Ensembles of Neural Nets, for which Convolutional Neural Networks are utilised and represent state-of-the art models, which use better-than-averaging ensembling.
Model Combination -- ensembles enrich the space of hypotheses considered by the base model class and are representationally richer.
Estimation and Optimization Error Reduction -- ensemble averaging reduces the variance of base models, averaging out variations due to objective function's non-convexity and stochastic learning.
The most common model to predict the probability of default is Logistic Regression (also known as logit). There is developed Maximum Likelihood Estimation theory and math that supports the work with the binary dependent variable: {0,1} set of survival/default outcomes. We will review that logistic model's MLE in the relevant and yet another separate Tutorial (Asset price direction prediction with simplified classifier methods).
There is also a special function -- logistic sigmoid -- that converts regression output into a probability figure.
When the Logistic Regression alone achieves poor prediction in the Default class label {1}, we invoke the standard trick of a data scientist -- the bagging of a regresssor. This is a meta-estimator that reduces the variance of a black-box estimator (eg, decision tree can be elaborate and hard to explain why particular spits chosen). Randomisation intentionally introduced into construction procedure (either random subset, or randomly breaking the model!), resulting in the ensemble of models.
A bagging regressor fits base regressors each on random subsets of the original data and then, aggregate predictions for the individual line (by voting/simple averaging) to form the single final prediction.
Tree-based models are a class of ML/supervised learning models that are used for both classification and regression. In essence, the tree is a series of binary questions. The output of a trained model resembles a tree with branches, nodes and leaves (final nodes) -- each node contains a combination of class labels but preffered to be 'pure' or entropy-minimising.
Overfitting is not very difficult to achieve for a decision tree classifier, especially when utilised as one tree and give explanatory benefits when visualised, while visual synthesis across a Random Forest is out of reach.
Decision trees are very adaptive to the dataset, not many branches are required to zero-in on MinNumInLeaf of 1-2 observations. We will see in our case how in the Corporate_PD dataset with 72 default instances out of 4,000 annual observations -- one decision tree quickly isolated those instance such, that 100% accuracy was achieved. Without the information about in-sample error and training error -- we can't even begin to make projections about how wrong this tree will behave out-of-sample.
AdaBoostM1 was the first successful boosting algorithm developed for binary classification. AdaBoost is a good starting point to study the simple scheme of boosting and related maths. The large `however' is in the strategy of AdaBoost type of algorithm to progressively focus on the observations that yield the largest prediction errors. Focusing on better prediction for the outliers might turn out not to be the best strategy.
CQF delegates can refer to Ensemble Methods lecture (quick introduction of decision tree principles and AdaBoost maths), then Section 3 of DT Theory and Stock Selection - T Guida.ipynb. After that, move on to Section 4 of that notebook as well as Python Lab on Extreme Gradient Boosting.
Sections 1.3 - 1.4 of DT Theory and Stock Selection - T Guida.ipynb discuss how you should analyse and prune an individual tree. Discussion of pruning criteria overlaps with hyperparameter tuning, which we make explict through exploring parameter space in this notebook.
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('fivethirtyeight')
from IPython.display import Image
# Preprocessing & Cross validation
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, RandomizedSearchCV, TimeSeriesSplit, cross_val_score, GridSearchCV
# Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pandas can read Excel files as well as fancier xlrd and the newer openpyxl Python packages.
Access to databases with credit default information is tricky. We read from Corporate_PD.xlsm from the older but nonetheless indicative implementation of Estimating Default Probability that takes from Altman Z-score model and extends estimation to full Generalised Linear Model (logistic regression).
xl = pd.ExcelFile('data/Corporate_PD.xlsm')
xl.sheet_names
AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#also ['Const','Intercept'] # scores_altman[0][0]
# usecols stopped working after pandas 0.20.0
# this has been fixed but NOT IN pandas 0.23.0 install
# https://github.com/pandas-dev/pandas/issues/18273
# https://github.com/jacksonjos/pandas/commit/e25710065c8b9291a35289bf3ef97cc62bf966cf
#Y_Response = pd.read_excel(xl, sheet_name="Logit", header=0, usecols=['Default'])
#X_Features = pd.read_excel(xl, sheet_name="Logit", header=0, usecols=['WC/TA', 'RE/TA', 'EBIT/TA', 'ME/TL', 'S/TA'])
#['Const', WC/TA', 'RE/TA', 'EBIT/TA', 'ME/TL', 'S/TA']
X_Features = pd.read_excel(xl, sheet_name="Logit", usecols=[4,5,6,7,8])
X_Features.head()
X_Features.info() #len(X_Features)
ravel() function converts our Y_Response from pandas DataFrame type to numpy Array type.
Some classifiers 'do not like' dataframe as an input for single-column dependent variable Y.
There is more, sometimes classifiers do not like an numpy array as an input, and we need to reshape into a 2D matrix
Y_Response = pd.read_excel(xl, sheet_name='Logit', usecols=[2])
Y_Response.head()
Y_Response = Y_Response['Default'].ravel()
print(Y_Response)
Y_Response_2D = np.reshape(Y_Response, (-1, 1)) # For the fussy OLS computation in sklearn IMPORTANT FEATURE
print(Y_Response_2D)
# Reshaping an array from 1D plain list (above) to 2D is IMPORTANT functionality. Similar to vec2mat() in Matlab
#x = np.reshape(x, (len(x),-1))
Most classifiers are suitable for both binary and multinomial classification, Logistic Regression, SVM, Naive Bayes, Random Forest.
Here we will use a small credit default dataset with certain financial ratios (Altman Z-scores): Corporate_PD.xlsm.
CLASS LABELS -- these are our {0,1} set of survival/default outcomes.
Classification is done for our Default indicator column (Y_Response).
This binary classification has the most developed set of accruacy measures (which are not always available for multinomial classification/deep neural networks).
FEATURES -- these are our financial ratios WC/TA, RE/TA, and others.
Effective visualisation starts with scatterplots and hyperplanes for two choosen features (2D visualisation): plotted one versus another. Functionality such as radviz visualises tension plots between three features on 2D plane and can be informative.
For our corporate defaults, logistic regression computed in Excel by explicit MLE (based on the log of logit function) has the following exact result:
If you need need the full logit inference with standard errors and p-values, it is best to use statsmodels package in Python.
Or compute the Inverse Information $\mathbf{I}^{-1}$ by exact Maximum Likelihood Estimation explicitly! Implemented in Corporate_PD.xls file (Excel) with some VBA and use of Solver.
The regression model above is GLM (Generalised Linear Model) from the traditional Altman Z-score model which is referred to as probit regression (nonlinear link is inverse of the Normal CDF).
TAKE AWAYS FROM VARIOUS CLASSIFIERS RUN ON CORPORATE DEFAULT DATA
The most significant feature is Retained Earnings / Total Assets (RE/TA), while in the dataset there were plenty of negative retained earnings (annual loss) the default signal given by the fitted sigmod is too late: negative Retained Earnings (loss) has to be very large, then the model classifies an obvious case of default.
Choosen for its negative feedback, Market Equity (Cap)/Total Liabilities (ME/TL) feature is an example of exogenous variable and large cap is associated with survival.
Confusion matrices help to look into prediction accuracy within lass. As soon as we separate the dataset into train/test halves, Logistic Regression MISCLASSIFIED 39 out of 41 defaults. We also noticed that our Corporate PD.xls data is imbalanced: only 2% of companies defaulted.
Crossvalidation proves useless: regardless how we draw the sample, the mistake in predicting defaults is just not affecting the total accruacy score, which remains about 98%.
To tackle the issue, we created machine learning 'pipeline' where we fed Logistic Classifier object into Bagging Regressor. The method to fit the model (often a decision tree) from multiple sub-samples. Each competing model makes a prediction, and these predictions are averaged. That allowed us to predict up to half defaults correctly.
Visualisaiton of Decision Tree regressor revealed how it starts building the tree by separating within the feature ME/TL. Clustering with Radviz 'spring tension' plots confirmed that small cap companies tend to have higher rate of default.
It is important to use the Decision Tree Regressor because Decision Tree Classifier creates a very elaborate tree to predict data points on which it was trained close to perfectly.
The open question remains if the fitted Decision Tree will be as effective on OUT OF SAMPLE data. Are we any closer to predicting defaults, more specifically, for the large cap companies that are in trouble?
from sklearn import linear_model
# sklearn Logistic Regression as the classifier. Adding column with intercept value -- not required, already in Excel data.
logit = linear_model.LogisticRegression(C=1e5)
logit.fit(X_Features, Y_Response) ###FITTING DONE HERE
print()
print('Logit beta coefficients for WC/TA, RE/TA, EBIT/TA, ME/TL, S/TA:')
print(logit.coef_)
print()
print('Intercept:')
print(logit.intercept_)
print()
print('model.classes_ =', logit.classes_)
We design our own routine to plot logistic sigmmoid to our specifications.
The input is the viable range of parameters for the feature, which for RE/TA ranges within $-7, 5$.
np.linspace(X_min, X_max, 3000) generates 3000 values in that range incrementally -- those are our input values (axe X).
logistic_sigmoid is computed with the use of relevant beta value (for RE/TA feature) from the logistic regresssion,
logit.coef_[0,FeatureBetaIdx] + logit.intercept_
OLS naive regression line is added using the same beta coefficient (for comparison).
data scatterplots are added at the start, using different colour for survival/default class labels {0,1}. The data is simplly scattered around the levels of zero, and one (horizontally).
ax.scatter(X_Features[FeatureName], Y_Response, c=Y_Response, zorder=20)
def logistic_sigmoid(xb):
return (1 / (1 + np.exp(-xb)))
#AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#Procedure RELIES X_Features, Y_Response variables to be existing
def logistic_plot(X_min, X_max, FeatureName, FeatureBetaIdx):
plt.clf() #clears the figure drawing space, nothing to do with classifier!
fig, ax = plt.subplots(figsize=(18,10)) #fig = plt.figure(figsize=(18,10))
# 1. Plot two clusters of observations at Y={0,1} on a scatter
ax.scatter(X_Features[FeatureName], Y_Response, c=Y_Response, zorder=20) #X_Features[FeatureName].ravel()
# 2. Plot CALIBRATED sigmoid function (in orange) -- with the coeffient from Logistic Regression
X_Sim = np.linspace(X_min, X_max, 3000) #Fill in 3,000 values for the range of axe X
Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel()
ax.plot(X_Sim, Y_Loss, color='red', linewidth=3)
# 3. Below plots OLS line
ols = linear_model.LinearRegression()
ols.fit(X_Features, Y_Response_2D) ### Fitting done here. Fussy OLS computation requires np.reshape(Y_Response, (-1, 1))
ax.plot(X_Sim, ols.coef_[0,FeatureBetaIdx] * X_Sim + ols.intercept_, linewidth=1) #we plot beta*x + intercept (omitting other betas)
ax.axhline(.5, color='.5')
plt.ylabel('Default Indicator 1 or 0', fontsize=22) # also ax.set_ylabel('Default Indicator')
plt.xlabel('wrt Feature: ' + AltmanScores_names[FeatureBetaIdx][1], fontsize=22)
plt.xticks(range(X_min, X_max), fontsize=14) #Axe X range
plt.yticks([0, 1], fontsize=14)
plt.ylim(-.25, 1.25)
plt.xlim(X_min, X_max) #Axe X range
plt.legend(('Logistic Regression', 'Linear Regression'),
loc="lower right", fontsize=14)
#plt.show()
return ax
# plt.subplots() is a function that returns a tuple containing a figure and axes object(s).
# Thus when using fig, ax = plt.subplots() you unpack this tuple into the variables fig and ax.
#type(fig) #<class 'matplotlib.figure.Figure'>
#type(ax) #<class 'matplotlib.axes._subplots.AxesSubplot'>
logistic_plot(-7, 5, 'RE/TA', 1)
FEATURE RE/TA (above) we expolore this dimension, which is THE MOST SIGNIFICANT (t statistic) logit model coefficient.
Retained Earnings (Loss) ratio has both positive and negative values which is informative: most defaulting companies have a negative ratio (in any case RE/TA < 1). That is about where the sigmoid for non-default goes into 0 ('fires' no default signal). Unfortunately, its default signal is too late (for RE/TA < -3). This classifier needs more data in order to train to recognise the default.
The economic implication is that in this dataset, the lack of cash was not the main cause to default (the next feature, Working Capital is insignifcant in logit at all).
For this most significant logit feature (regression variable), the linear model makes some sense because it separated non-default cases into ones that non-overlap with default in the dimenision of this ratio at about $RE/TA > 0.6$
Linear regression (OLS) 'tries to mimick' the non-linear sigmoid here -- in the right direction but far from being a good classifier for the task.
logistic_plot(-7, 7, 'WC/TA', 0)
FEATURE WC/TA (above) this dimension is for an INSIGNIFICANT logit model coefficient. Positive sign in front of the coefficient gives us an inverted sigmoid, stretched to a very long range -- this is what insignificance means.
We still see that the classifier (sigmoid) not calibrated to differentiate between default and non-default.
logistic_plot(-3, 5, 'ME/TL', 3)
FEATURE MC/TL (above) for the LEAST significant regression coefficient. This is typical of an exogenous variable (vs endogenous). The companies with the very large market capitalisation default less on their obligations. Defaults are clustered in small-cap range but there is no good separation from non-defaults.
Logistic vs OLS The linear regression model (OLS) is presented for comparison but in this case it is clearly 'confused' into a horizontal line. where it does not cross the cluster of observations but this is an artefact of separating 0s from 1s.
WHAT MAKES THE DIFFERENCE?
Softmax regression referred to as multinomial case for logistic regression.
Default setting for the binary classification is multi_class='ovr'. That, however can be set as multi_class='multinomial'.
Multinomial prediction will inevitably be based on the softmax function: it takes in the multidimensional covariate of features X and maps to multiple outputs. Probability prediction will be for each class. That would be applicable credit rating situations or gradations of companies in near-default state.
Parameter C is the inverse of regularization strength. The parameter is alike one in Support Vector Machines (described in further sections). Smaller values specify stronger regularization. Here, the type of regularisation is L2 penalty.
logit = linear_model.LogisticRegression(C=1e5)
L2 regularisation creates a Ridge regression (linear, logit). $C=1e5$ is very small meaning we nearly not regularise, therefore, $\beta$ values are not changed compared to Maximum Likelihood Estimation estimation of logit.
Corporate PD.xls Excel file has insides of the implementation of cannonial GLM (logit, probit) estimation by explicit Maximum Likelihood: Solver is run to vary beta to maximise the total log-likelihood (computed explicitly for the regression).
As an alternative to MLE, Logistic Cost Function can be solved by gradient descent. That approach allows modifying the penalty term, which is typically and below the sum of squares of regression coefficients.
That is called L2 regularization, the penalty works against regression coefficients being too large, eg. 8 vs 1.2.
where for all reqression coefficients $w$ (instead of $\beta$), $\qquad \displaystyle \text{L2}: \lambda \, || w_2 || = \lambda \, \sum_{j}^{N_{coeff}} w_j^2 $.
The larger the value of $\lambda$, the stronger the regularization of the model (weights approach zero).
#Image('../CODE_ML/image/Logistic_L2Regularisation.png', width=500)

3) Without regularization, the objective is to find the global cost minimum (min SSE). By adding a regularization penalty, the objective becomes to minimize with the budget constraint where the budget space for $w_1, w_2$ parameters is given by the grey circle.
Transition probabilities are possible to extract for each type of classifier (eg, Logistic, SVM, even Decision Tree). Here, for Logistic Classifier $p$ is computed by logistic function on $\beta' X$ and represents the probability of default -- class label 1 and second column.
The probability of survival is $1-p$, and given in the first column.
The output gives way to probability transition matrix but not between states as with Markov chains, just per observation. If we group observations into say, credit ratings then it would be possible to further compute the proper transition matrix.
# Remember we fitted the Logistic Model (no need to specify an intercept additionaly)
#logit = linear_model.LogisticRegression(C=1e5)
#logit.fit(X_Features, Y_Response)
#clf.predict(X_Features)
logit.predict_proba(X_Features)
y_pred = logit.predict(X_Features)
print(classification_report(Y_Response, y_pred))
# Confusion Matrix
cm = confusion_matrix(Y_Response, y_pred)
temp = pd.DataFrame(cm, index=['0', '1'], columns=['0', '1'])
sns.heatmap(temp, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
There exist Python packages to address imbalanced samples. However, the popular data scientist choice is to tackle the problems (of impalanced sample and unsatisfactory prediction) by improving and complicating the estimation method, the model -- creating a Bagging Classifier pipeline.
Definition: a bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original data and then, aggregate predictions for the individual line (by voting/averaging) to form the single final prediction.
This is a meta-estimator that reduces the variance of a black-box estimator (eg, decision tree can be elaborate and hard to explain why particular spits were chosen). Randomisation intentionally introduced into construction procedure (either random subset, or randomly breaking the model!), resulting in the ensemble of models.
Further advancement to bagging/boosting (Ensemble methods)
Description for a BAGGING ALGORITHM: take samples from the training data, then models (usually decision trees) are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value (this is like softmax function, but averaging is done over models that differs structurally).
Here the observations count on which Bagging Regressor operates is obviously above $> 4,000$ (Ncompanies in our dataset) -- notice the density of dots.
Default parameter was n_estimators = 10, so 10 alternative Decision Tree models were produced, each will have different companies predicted as defaulted (REF).
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingClassifier
# Common application over KNeighborsClassifier()
# from sklearn.neighbors import KNeighborsClassifier
#bagging_classKNN = BaggingClassifier(KNeighborsClassifier(),
max_samples=0.5, max_features=0.5)
#bagging_classSVN = BaggingClassifier(SVC(C=C, probability=True),
# max_samples=0.5, max_features=0.5)
#We are going to use Logistic Classifier. Below were already executed so 'logit' variable exists
#logit = linear_model.LogisticRegression(C=1e5)
#logit.fit(X_Features, Y_Response)
bagging_reg_logit = BaggingRegressor(logit)
bagging_reg_logit.fit(X_Features, Y_Response)
bagging_pred_logit = bagging_reg_logit.predict(X_Features) #this is just an array with predictions
bagging_pred_logit
print()
print('Returns the coefficient of determination R^2 of the prediction:')
print(bagging_reg_logit.score(X_Features, Y_Response, sample_weight=None))
This is very poor R^2 but see more how we ustilise and interpret the output of bagging.
#print()
#print('The collection of fitted sub-estimators')
#print(bagging_reg_logit.estimators_)
#Returns a dynamically generated list of indices identifying the samples used for fitting each member of the ensemble, i.e., the in-bag samples.
#Note: the list is re-created at each call to the property in order to reduce the object memory footprint by not storing the sampling data.
print('The subset of drawn samples for each base estimator (Logistic Classifier)')
print(bagging_reg_logit.estimators_samples_)
print()
print(bagging_reg_logit.estimators_features_)
#print('Score of the training dataset obtained using an out-of-bag estimate')
#print(bagging_reg_logit.oob_score_) #NOT IN THIS VERSION
#Quick scatter to check
#plt.scatter(X_Features['RE/TA'], bagging_pred_logit, c=bagging_pred_logit, zorder=20)
#Full scatter plot with sigmoid
plt.figure(figsize=(18,10))
plt.scatter(X_Features['RE/TA'], bagging_pred_logit, c=bagging_pred_logit, zorder=20) #X_Features[FeatureName].ravel()
plt.colorbar()
plt.ylabel('Default Likelihood: Bagged Logistic Classifier', fontsize=20) # also ax.set_ylabel('Default Indicator')
plt.xlabel('wrt Feature: ' + 'RE/TA', fontsize=20)
X_Sim = np.linspace(-4, 2, 1500) #Fill in 3,000 values for the range of axe
Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,1] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel()
plt.plot(X_Sim, Y_Loss, color='red', linewidth=1)
plt.show()
INTERPRETATION
The default number of estimators (alternative logistic regressions) is 10 (_nestimators int, optional (default=10)). On the plot above: all ten models predicted the same two companies (top dots) to deafult; nine models predicted one the same company to default.
We filtered out those three very clear defaults, for which Retained Earnings/Total Assets ratio $<-1.8$.
It is also agreeable that for $\text{RE/TA} > 0.3$, there are no defaults.
Conclusion: depending on how the threshold $\text{L} > 0.2$ is set from the gradient scale on the right, 12 defaults or more defaults can be predicted compared to 2. That is x6 increase in accuracy! Compared to non-workable k-fold crossvalidation.
Exact results WILL DIFFER ON EACH RUN.
bagging_reg = BaggingRegressor()
bagging_reg.fit(X_Features, Y_Response)
bagging_pred = bagging_reg.predict(X_Features)
plt.figure(figsize=(18,10))
plt.scatter(X_Features['RE/TA'], bagging_pred, c=bagging_pred, zorder=20) #X_Features[FeatureName].ravel()
plt.ylabel('Default Likelihood: Bagged Alternative Classifier', fontsize=20) # also ax.set_ylabel('Default Indicator')
plt.xlabel('wrt Feature: ' + 'RE/TA', fontsize=20)
plt.show()
On the plot ABOVE we see see default predictions from the random forest (10 decision trees) -- yellow points.
Those 10 decision trees are perhaps too different than necessary, but each is accurate in its unique way. Combining their predictions results in a better estimate of the true value.
A clever bit of bagging: decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness. That should improve out-of-sample predictions. (Otherwise, given large enough tree it will fit the sample excessively perfect).
TAKE AWAY If you get good classifying or clastring results with an algorithm that has inherently high variance (eg Decision Tree), you can often get better results by bagging that algorithm.
Below is continuation of our bagging of the algorithm, now with the default base estimator method being Decision Tree. This is more common combination (Bagging + Random Forest).
#AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#
#Procedure not independent, relies on X_Features, Y_Response
def bagging_scatterX2(FeatureName, FeatureBetaIdx, comp_estimator):
plt.clf #to clean canvass
fig = plt.figure(figsize=(18,10))
# 1) Bagging estimator using Logit as base
bagging_reg_logit = BaggingRegressor(logit)
bagging_reg_logit.fit(X_Features, Y_Response)
bagging_pred_logit = bagging_reg_logit.predict(X_Features)
plt.subplot(221)
plt.scatter(X_Features[FeatureName], bagging_pred_logit, c=bagging_pred_logit, zorder=20)
plt.colorbar()
plt.ylabel('Bagged Logistic Classifier', fontsize=12)
plt.xlabel('wrt Feature: ' + FeatureName, fontsize=12)
# Add CALIBRATED sigmoid function (in orange) -- with the coeffient from Logistic Regression
X_Sim = np.linspace(-4, 2, 1500) #Fill in 3,000 values for the range of axe
Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_) # Y_Loss = logistic_sigmoid(X_Sim * logit.coef_[0,FeatureBetaIdx] + logit.intercept_).ravel()
plt.plot(X_Sim, Y_Loss, color='red', linewidth=1)
# 2) Bagging estimator using AN ALTERNATIVE as base, no input = decision tree
bagging_reg = BaggingRegressor(comp_estimator)
bagging_reg.fit(X_Features, Y_Response)
bagging_pred = bagging_reg.predict(X_Features)
plt.subplot(222)
plt.scatter(X_Features[FeatureName], bagging_pred, c=bagging_pred, zorder=20)
plt.colorbar()
plt.ylabel('Bagged ALTERNATIVE Classifier', fontsize=12)
plt.xlabel('wrt Feature: ' + FeatureName, fontsize=12)
#plt.show()
return
bagging_scatterX2('RE/TA', 1, None) #most significant feature
bagging_scatterX2('WC/TA', 0, None) #inignificant feature in logit
bagging_scatterX2('ME/TL', 3, None) #least significant feature
Decision Tree algorithm minimises entropy or 'seeks purity': the algorithm searches for a split that will lead to clusters that are as pure as possible, ie, with one very dominant class. The search is done according to the defined Loss Function which is typically the Gini impurity index.
All loss functions can be expressed as the proportions generated by the output: if there are J classes, we denote these proportions with $p_j$.
the Gini impurity index: $ 1-\sum_{j=1}^Jp_j^2; $
the misclassification error: $ 1-\underset{j}{\text{max}}\, p_j; $
entropy: $ -\sum_{j=1}^J\log(p_j)p_j. $
Initially we will run decision trees on the dataset of TWO FEATURES only, which we have learned to be the most explanatory. These features coincide with our Restricted model M0 from Corporate PD.xls explicit by-step implementation of logistic classier in Excel.
Why two features only? You will understand quickly how unbounded, unpruned Decision Tree Classifier branches out to a very elaborate tree, which classifies nearly every observation with a separate, dedicated leaf! In fact, decision trees are adaptive to the sample, not many branches are required to zero-in on MinNumInLeaf of 1-2 observations.
In this demonstration, we will make the structure of decision tree explicit. But the extraction of individual trees, visualisation and synthesis across Random Forest are out of reach.
Classifier vs Regressor
The relative advantage of a regressor, in the context of bagging and boosting for the binary variable, is that it allows us to observe in-between values {0,1}. That allows a loose interpretation as a probability.
The inherent task is classification because the dependent variable is categorical (here binary). Therefore, Decision Tree Classifiers will naturally isolate class labels with more efficiency and report high accuracy scores > 0.95. Like with all classifiers, confusion matrix should be examined to see in-class prediction. This is more challenging to do with regression output.
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
import graphviz
from sklearn.tree import export_graphviz
from IPython.display import Image
# REPEAT OF Data Loading
AltmanScores_names = [['WC/TA', 'Working Capital/TA'], ['RE/TA', 'Retained Earnings/TA'], ['EBIT/TA', 'EBIT/TA'], ['ME/TL', 'Market Cap/Total Liabilities'], ['S/TA', 'Sales/TA'], ]
#also ['Const','Intercept'] # scores_altman[0][0]
X_Features = pd.read_excel(xl, sheet_name="Logit", usecols=[4,5,6,7,8])
X_Features.head()
Y_Response = pd.read_excel(xl, sheet_name='Logit', usecols=[2])
Y_Response = Y_Response['Default'].ravel()
X_Features_2D = X_Features.loc[:,['RE/TA', 'ME/TL']]
rng = np.random.RandomState(1)
regr_1 = DecisionTreeRegressor(max_depth=4)
regr_1.fit(X_Features_2D, Y_Response)
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
regr_2.fit(X_Features_2D, Y_Response)
# PREDICTION
y_1 = regr_1.predict(X_Features_2D)
y_2 = regr_2.predict(X_Features_2D)
y_1[1:100]
y_2[1:100]
Its in-sample prediction accuracy will is near 100%. When in-sample error is zero, and training error is zero -- that does not allow us to compute any estimates of the extra-sample error (think “0.632 Estimator”), exactly when the model is surely overfitted and the extra-sample error going to be at its largest.
dot_data = export_graphviz(regr_1, out_file=None,
feature_names= ['RE/TA', 'ME/TL'],
class_names=['survival', 'default'], filled=True,
rounded=True,
proportion=True)
graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeREGRESSOR', format='png')
graph.render(); #saved graph as an image in target directory
Image('../CODE_ML_DT/image/TreeREGRESSOR_COPY.png')
Problem 1. For MacOS, simple install with conda should resolve the issues, below are Terminal commands. Conda installs seem to be in addition to pip install (and installed into separate directory /python 3.7/site-packages). Conda environment location: /Users/YOURUSERNAME/opt/anaconda3
conda install graphviz
conda install -c conda-forge pydotplus
RuntimeError/Graphviz executables - reinstall
RuntimeError/Graphviz executables - edit PATH
Problem 2. It is only possible to visiualise one tree at a time (not ensemble of trees / Random Forest). To make things difficult, there is no an easy way to graph the "best" tree. Though, he sample code below gives an idea on how to pick one tree from the forest.
forest_clf = RandomForestClassifier()
forest_clf.fit(X_train, y_train)
tree.export_graphviz(forest_clf.estimators_[0], out_file='tree_from_forest.dot')
(graph,) = pydot.graph_from_dot_file('tree_from_forest.dot')
graph.write_png('tree_from_forest.png')
_sklearn.tree.export_graphviz(decisiontree, ...) exports one decision tree. If you attempt to pass fitted Random Forest into this function (AdaBoostRegressor, AdaBoostClassifier), the export for graphviz visualisaiton will fail.
Below are experiments with Decision Tree Classifiers using Gini impurity index and Entropy as loss functions. Observe that Gini results are exactly similar to Decision Tree Regressor, which reports MSE as its loss function.
clf = DecisionTreeClassifier(random_state=rng, max_depth=4) #bounded classifier - decision tree pruned
clf.fit(X_Features_2D, Y_Response)
print(f'Tree Regressor Accuracy: {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Gini impurity): {clf.score(X_Features_2D, Y_Response):0.4}')
dot_data = export_graphviz(clf, out_file=None,
feature_names= ['RE/TA', 'ME/TL'],
class_names=['survival', 'default'], filled=True,
rounded=True,
proportion=True)
graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_GINI', format='png')
graph.render();
Image('../CODE_ML_DT/image/TreeCLASSIFIER_GINI_COPY.png')
clf = DecisionTreeClassifier(random_state=rng, max_depth=4, criterion='entropy') #bounded classifier - decision tree pruned
clf.fit(X_Features_2D, Y_Response)
print(f'Tree Regressor Accuracy: {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Entropy): {clf.score(X_Features_2D, Y_Response):0.4}')
dot_data = export_graphviz(clf, out_file=None,
feature_names= ['RE/TA', 'ME/TL'],
class_names=['survival', 'default'], filled=True,
rounded=True,
proportion=True)
graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_ENTROPY', format='png')
graph.render();
Image('../CODE_ML_DT/image/TreeCLASSIFIER_ENTROPY_COPY.png')
clf = DecisionTreeClassifier(random_state=rng) #unlimited max_depth classifier
clf.fit(X_Features_2D, Y_Response)
dot_data = export_graphviz(clf, out_file=None,
feature_names= ['RE/TA', 'ME/TL'],
class_names=['survival', 'default'], filled=True,
rounded=True,
proportion=True)
graph = graphviz.Source(dot_data, filename='../CODE_ML_DT/image/TreeCLASSIFIER_NOPRUNING', format='png')
graph.render();
Image('../CODE_ML_DT/image/TreeCLASSIFIER_NOPRUNING_COPY.png')
# Predict
y_3 = clf.predict(X_Features_2D) # below are predictions form unpruned Classifer with zero in-sample error.
y_3[1:100]
y_3[3950:4000] # about 37 default cases are at the end of the dataset of 4,000 observations
Unbound, unpruned Decision Tree Classifier returned to us the predictions that coincide exactly with survival 0 / default 1 in the original data.
That was not very difficult to achieve for the decision tree method on this Corporate_PD dataset. After all, it is very imbalanced: among 4,000 annual observations (830 companies, 1999-2004), there were only 72 default instances (observations) occurred within this five-year period.
This is overfitting at its finest and it's easy to overlook. Without the information about in-sample error and training error -- we can't even begin to make projections about how wrong this tree will behave out-of-sample. In this context, we skip Train/Test split exercise (no need) because we go direct for Boosted Decision Trees which will be estimated using different subsamples (up to 300).
We will hit another issue with our chosen method to boost trees (AdaBoost), and therefore recommend XGBClassifier with properly scaled data as the next step of improvement (for CQF delegates -- see the relevant Python Lab on eXtreme Gradient Boosting).
print(f'Tree Regressor Accuracy (Untuned) : {regr_1.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Regressor Accuracy AdaBoost: {regr_2.score(X_Features_2D, Y_Response):0.4}')
print(f'Tree Classifer Accuracy (Unpruned): {clf.score(X_Features_2D, Y_Response):0.4}')
# Confusion Matrix for Unpruned Decision Tree Classifier (100% in-sample accuracy)
cm = confusion_matrix(Y_Response, y_3)
temp = pd.DataFrame(cm, index=['0', '1'], columns=['0', '1'])
sns.heatmap(temp, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
For any classifier (Decision Tree, Logistic, SVM, etc), the common and automated search for the best hyperparameters is done using GridSearchCV or RandomizedSearchCV. Both implement a “fit” and a “score” methods and rely on cross-validated search over parameter space.
RandomizedSearchCV method is more appropriate when running complicated pipelines of boosted decision trees (eg XGBoost) because a fixed number of parameter settings is sampled from the specified distributions. Not all parameter values are tried as with GridSearchCV and that saves time.
However for our end purpose to plot surfaces that relate the key hyperparameters of a decision tree (MinNum2Split, MinNumInLeaf, MaxDepth) it was necessary to code manual iteration and saving of model space results.
from pylab import plt
plt.style.use('seaborn')
import matplotlib.pyplot as pyplot
from mpl_toolkits import mplot3d
### OMIT CODE FROM RELEASE ###
### OMIT CODE FROM RELEASE ###
3D plots exporing hyperparameter optimal search are presented below.
First, results have been saved for Decision Tree Classifer -- that can be seen from the relatively high accuracy range (0.93, 0.955). maxDepth was allowed to vary from 1 to 10 as well but 4 remained the optimal choice.
accuracy_space = checkHyperparamsDT(X_Features_2D, Y_Response)
accuracy_space.loc[accuracy_space['TestAccuracy'].idxmax(),:].to_frame().T
fig = pyplot.figure()
ax = fig.gca(projection='3d')
clf_surf1=ax.plot_trisurf(accuracy_space['MinNum2Split'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf1, shrink=0.5, aspect=5)
ax.set_zlim(0.93, 0.955) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min to Split')
plt.ylabel('Min in Leaf')
plt.savefig('image/clf_surf1.png')
plt.show()
fig = pyplot.figure()
ax = fig.gca(projection='3d')
clf_surf2 =ax.plot_trisurf(accuracy_space['MaxDepth'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf2, shrink=0.5, aspect=5)
ax.set_zlim(0.93, 0.955) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min in Leaf')
plt.ylabel('Maximum Depth')
plt.savefig('image/clf_surf2.png')
plt.show()
Second, the code above altered to run Decision Tree Regressor (Classifier line commented out) and results below are saved for Decision Tree Regressor
maxDepth was allowed to range from 1 to 10 -- even as it leads to unmanageable trees, and does not improve accuracy towards 100% correct in-sample prediction. The viable accuracy range for test data for a tree regressor is (0, 0.20).
accuracy_space = checkHyperparamsDT(X_Features_2D, Y_Response)
accuracy_space.loc[accuracy_space['TestAccuracy'].idxmax(),:].to_frame().T
fig = pyplot.figure()
ax = fig.gca(projection='3d')
reg_surf1=ax.plot_trisurf(accuracy_space['MinNum2Split'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf1, shrink=0.5, aspect=5)
ax.set_zlim(0, 0.20) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min to Split')
plt.ylabel('Min in Leaf')
plt.savefig('image/clf_surf1.png')
plt.show()
fig = pyplot.figure()
ax = fig.gca(projection='3d')
reg_surf2 =ax.plot_trisurf(accuracy_space['MaxDepth'], accuracy_space['MinNumInLeaf'], accuracy_space['TestAccuracy'], cmap=plt.cm.viridis, linewidth=0.2)
fig.colorbar(clf_surf2, shrink=0.5, aspect=5)
ax.set_zlim(0, 0.20) #MANUAL ADJUSTMENT -- tight values since Classifier predicts too well
plt.xlabel('Min in Leaf')
plt.ylabel('Maximum Depth')
plt.savefig('image/clf_surf2.png')
plt.show()
AdaBoostM1 was the first successful boosting algorithm developed for binary classification. AdaBoost is a good starting point to study the simple scheme of boosting and related maths. For the quick introduction, CQF delegates can refer to Ensemble Methods lecture and Section 3 of DT Theory and Stock Selection - T Guida.ipynb.
The large `however' is in the strategy of AdaBoost type of algorithm to progressively focus on the observations that yield the largest prediction errors. Focusing on better prediction for the outliers might turn out not to be the best strategy.
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.
First, a model is fitted from the training data, then
Second and next models attempt to correct the errors from the first (prior) model.
plt.clf
plt.figure(figsize=(12, 10))
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples")
plt.scatter(X_Features_2D['RE/TA'], y_1, c="g", label="n_estimators=1", linewidth=2)
plt.scatter(X_Features_2D['RE/TA'], y_2, c="r", label="n_estimators=300", linewidth=2)
plt.xlabel("for feature: Retained Earnings/Total Assets", fontsize=14)
plt.ylabel("target", fontsize=14)
plt.title("TWO MODELS (Decision Tree - green, Boosted - red): Predictions for Corporate Default", fontsize=14) #plt.colorbar() ADD ON NEXT RUN
plt.legend()
plt.show()
#xlabel= 'wrt Feature: ' + AltmanScores_names[FeatureBetaIdx][1]
#plt.xlabel(xlabel)
NOTE Two models are mixed on scratterplot such as these. In fact there are three datasets: ONE ACTUAL and TWO PREDICTED from our models:
regr_1 = DecisionTreeRegressor(max_depth=4)
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=300, random_state=rng)
The scatterered output reveals plenty of search for our binary classification of default/non-default in Corporate PD dataset.
The nature of AdaBoost Classifier is to create copies of the trees (meaning copies of the dataset) and we see the search example produced by a random forest.
plt.clf #to clean canvass
fig = plt.figure(figsize=(18,10))
plt.subplot(221)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_1, c="g", label="n_estimators=1", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('Decision Tree Regression', fontsize=14)
plt.subplot(222)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_2, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('BOOSTED Decision Tree Regression', fontsize=14)
plt.subplot(212)
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples(data)")
plt.scatter(X_Features_2D['RE/TA'], y_3, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Retained Earnings/Total Assets', fontsize=14)
plt.title('Decision Tree CLASSIFIER', fontsize=14)
plt.show()
Retained Earnings/Total Assets feature.
AdaBoostRegressor clearly created random estmators (and struggled to predict Defaults?)
Decision Tree Classifier has such an elaborate fitted decision tree (above) that it predicts every in-sample point pretty much exactly.
plt.clf #to clean canvass
fig = plt.figure(figsize=(18,10))
plt.subplot(221)
plt.scatter(X_Features_2D['ME/TL'], Y_Response, c="k", label="training samples (data)")
plt.scatter(X_Features_2D['ME/TL'], y_1, c="g", label="n_estimators=1", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Market Equity (Cap)/Total Liabilities', fontsize=14)
plt.title('Decision Tree Regression', fontsize=14)
plt.subplot(222)
plt.scatter(X_Features_2D['ME/TL'], Y_Response, c="k", label="training samples (data)")
plt.scatter(X_Features_2D['ME/TL'], y_2, c="r", label="n_estimators=300", linewidth=2)
#plt.ylabel('Prediction target', fontsize=12)
plt.xlabel('Market Equity (Cap)/Total Liabilities', fontsize=14)
plt.title('BOOSTED Decision Tree Regression', fontsize=14)
plt.show()
Market Equity (Cap)/Total Liabilities feature.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
X = X_Features_2D #X_Features.loc[:,['RE/TA']]
y = Y_Response
# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X, y) #bdt for Boosted Decision Tree
bdt.predict_proba(X)
#EXPERIMENTAL -- ISSUE WITH INDEXATION OF X, need to limit to one feature?
plot_colors = "br"
plot_step = 0.02
class_names = "SD"
# Plot the decision boundaries
plt.figure(figsize=(20, 10))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")
# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1],
c=c, cmap=plt.cm.Paired,
s=20, edgecolor='k',
label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right', fontsize=14)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary', fontsize=14)
# Plot the two-class decision scores
class_names = "SD"
plt.figure(figsize=(15, 10))
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
for i, n, c in zip(range(2), class_names, plot_colors):
plt.hist(twoclass_output[y == i],
bins=10,
range=plot_range,
facecolor=c,
label='Class %s' % n,
alpha=.5,
edgecolor='k')
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right', fontsize=14)
plt.ylabel('Samples', fontsize=14)
plt.xlabel('Score', fontsize=14)
plt.title('Decision Scores', fontsize=14)
plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()
The predicted class label for each sample is determined by the sign of decision score. Samples with decision scores greater than zero are classified as DEFAULT, and are otherwise classified as SURVIVAL. The magnitude of a decision score determines the degree of likeness with the predicted class label.
Decision Tree Classifier limits itself to the set class labels. Decision Tree Regression is unlike that and provides some continous value for the dependent variable.
y_4 = bdt.predict(X)
plt.clf
plt.figure(figsize=(12, 10))
plt.scatter(X_Features_2D['RE/TA'], Y_Response, c="k", label="training samples")
plt.scatter(X_Features_2D['RE/TA'], y_4, c="r", label="n_estimators=200", linewidth=2)
plt.xlabel("Feature: Retained Earnings/Total Assets", fontsize=14)
plt.ylabel("target", fontsize=14)
plt.title("BOOSTED Decision Tree Classifier: Predictions for Corporate Default", fontsize=14) #plt.colorbar() ADD ON NEXT RUN
plt.legend()
plt.show()
TAKE AWAYS FROM TESTING OF ALL CLASSIFIERS
The most significant feature is Retained Earnings / Total Assets (RE/TA), while in the dataset there were plenty of negative retained earnings (annual loss) the default signal given by the fitted sigmod is too late: negative Retained Earnings (loss) has to be very large, then the model classifies an obvious case of default.
Choosen for its negative feedback, Market Equity (Cap)/Total Liabilities (ME/TL) feature is an example of exogenous variable and large cap is associated with survival.
Confusion matrices help to look into prediction accuracy within lass. As soon as we separate the dataset into train/test halves, Logistic Regression MISCLASSIFIED MOST defaults (eg 69 of 72). We also noticed that our Corporate PD.xls data is imbalanced: only 2% of companies defaulted.
Crossvalidation proves useless: regardless how we draw the sample, the mistake in predicting defaults is just not affecting the total accruacy score, which remains about 98%.
To tackle the issue, we created machine learning 'pipeline' where we fed Logistic Classifier object into Bagging Regressor. The method to fit the model (often a decision tree) from multiple sub-samples. Each competing model makes a prediction, and these predictions are averaged. That allowed us to predict up to half defaults correctly.
Visualisaiton of Decision Tree regressor revealed how it starts building the tree by separating within the feature ME/TL. Clustering with Radviz 'spring tension' plots confirmed that small cap companies tend to have higher rate of default.
It is important to use the Decision Tree Regressor because Decision Tree Classifier creates a very elaborate tree to predict data points on which it was trained close to perfectly.
The open question remains if the fitted Decision Tree will be as effective on OUT OF SAMPLE data. Are we any closer to predicting defaults, more specifically, for the large cap companies that are in trouble?
END OF DEMO
Generalised Linear Models http://scikit-learn.org/stable/modules/linear_model.html
Support Vector Machines http://scikit-learn.org/stable/modules/svm.html
Ensemble Methods http://scikit-learn.org/stable/modules/ensemble.html#adaboost
Slidepacks from Data Science and Machine Learning I, II Modules' lectures, and remember you will have access to Electives that also have ML/Time Series/Python content. This ML Lab's notebook has other links to useful throuhgout and together it -- Additional Python.zip (or alike name on CQF Portal) has Python notebooks from ML Elective and two other relevant Workshops (by Dr Yves Hilpisch).
ADVICE While mathematical clairty is helpful -- ML and AI can be very difficult for learning from textbooks.
You might have a perfect outline of a method (eg, Ridge and Lasso regressions) but not realise that none of it can be appplicable (or make much difference) for financial time series, such as prices of Market Indicies, ETFs, and individual Equities and Commodities.
Equally you might learn everything proper about Crossvalidation and Boostrap Samples (Elements of Statistical Learning) but realise that none of it matters in the situation where there is not much default data.
Moody's Default & Recovery Database
Includes more than 500,000 individual debt securities, both corporate and sovereign entities, and default history starting from 1920.
Flexibility: Provides universal identifiers, such as CUSIPs and SIC codes; primary keys for complex querying; and classifications for debt type, rating type, and region, as well as flags.
Altman Z-Score+ analysis of Boeing Company (Ticker : BA)
NYU Research Centers
We have now over 3,700 observations on defaulted bond issues covering the period 1971-2016. Data includes SIC, CUSIP, company identification (when available), bond description, date of default, price at default (Recovery), original bond rating, seniority, type of default (bankruptcy, missed interest payment, or distressed exchange) and more.
A complete list of corporate bankruptcies under either Chapter X and XII (1971-1978) or Chapter 11 (1979-present) of firms with liabilities equal to or greater than $100 million at time of filing. As of March 2017, the list includes over 2,260 bankruptcies with firm identifiers (when available), bankruptcy date, size of liabilities and SIC.
Defaulted bond prices are available from 1987 to present (2,078 bonds as of February 2017); defaulted bank loan prices on a monthly basis are available from 1996 through the present (861 loan facilities as of February 2017).
http://pages.stern.nyu.edu/~ealtman/Credit%20&%20Debt%20Markets%20Databases.htm